Deep Reinforcement Learning
   HOME

TheInfoList



OR:

Deep reinforcement learning (deep RL) is a subfield of
machine learning Machine learning (ML) is a field of inquiry devoted to understanding and building methods that 'learn', that is, methods that leverage data to improve performance on some set of tasks. It is seen as a part of artificial intelligence. Machine ...
that combines
reinforcement learning Reinforcement learning (RL) is an area of machine learning concerned with how intelligent agents ought to take actions in an environment in order to maximize the notion of cumulative reward. Reinforcement learning is one of three basic machine ...
(RL) and deep learning. RL considers the problem of a computational agent learning to make decisions by trial and error. Deep RL incorporates deep learning into the solution, allowing agents to make decisions from unstructured input data without manual engineering of the
state space A state space is the set of all possible configurations of a system. It is a useful abstraction for reasoning about the behavior of a given system and is widely used in the fields of artificial intelligence and game theory. For instance, the to ...
. Deep RL algorithms are able to take in very large inputs (e.g. every pixel rendered to the screen in a video game) and decide what actions to perform to optimize an objective (e.g. maximizing the game score). Deep reinforcement learning has been used for a diverse set of applications including but not limited to
robotics Robotics is an interdisciplinary branch of computer science and engineering. Robotics involves design, construction, operation, and use of robots. The goal of robotics is to design machines that can help and assist humans. Robotics integrate ...
,
video game Video games, also known as computer games, are electronic games that involves interaction with a user interface or input device such as a joystick, controller, keyboard, or motion sensing device to generate visual feedback. This fee ...
s, natural language processing, computer vision, education, transportation, finance and healthcare.


Overview


Deep learning

Deep learning is a form of
machine learning Machine learning (ML) is a field of inquiry devoted to understanding and building methods that 'learn', that is, methods that leverage data to improve performance on some set of tasks. It is seen as a part of artificial intelligence. Machine ...
that utilizes a neural network to transform a set of inputs into a set of outputs via an
artificial neural network Artificial neural networks (ANNs), usually simply called neural networks (NNs) or neural nets, are computing systems inspired by the biological neural networks that constitute animal brains. An ANN is based on a collection of connected unit ...
. Deep learning methods, often using
supervised learning Supervised learning (SL) is a machine learning paradigm for problems where the available data consists of labelled examples, meaning that each data point contains features (covariates) and an associated label. The goal of supervised learning alg ...
with labeled datasets, have been shown to solve tasks that involve handling complex, high-dimensional raw input data such as images, with less manual
feature engineering Feature engineering or feature extraction or feature discovery is the process of using domain knowledge to extract features (characteristics, properties, attributes) from raw data. The motivation is to use these extra features to improve the qua ...
than prior methods, enabling significant progress in several fields including computer vision and natural language processing.


Reinforcement learning

Reinforcement learning Reinforcement learning (RL) is an area of machine learning concerned with how intelligent agents ought to take actions in an environment in order to maximize the notion of cumulative reward. Reinforcement learning is one of three basic machine ...
is a process in which an agent learns to make decisions through trial and error. This problem is often modeled mathematically as a Markov decision process (MDP), where an agent at every timestep is in a state s, takes action a, receives a scalar reward and transitions to the next state s' according to environment dynamics p(s', s, a). The agent attempts to learn a policy \pi(a, s), or map from observations to actions, in order to maximize its returns (expected sum of rewards). In reinforcement learning (as opposed to optimal control) the algorithm only has access to the dynamics p(s', s, a) through sampling.


Deep reinforcement learning

In many practical decision-making problems, the states s of the MDP are high-dimensional (eg. images from a camera or the raw sensor stream from a robot) and cannot be solved by traditional RL algorithms. Deep reinforcement learning algorithms incorporate deep learning to solve such Maps a, often representing the policy \pi(a, s) or other learned functions as a neural network, and developing specialized algorithms that perform well in this setting.


History

Along with rising interest in neural networks beginning in the mid 1980s, interest grew in deep reinforcement learning, where a neural network is used in reinforcement learning to represent policies or value functions. Because in such a system, the entire decision making process from sensors to motors in a robot or agent involves a single neural network, it is also sometimes called end-to-end reinforcement learning. One of the first successful applications of reinforcement learning with neural networks was
TD-Gammon TD-Gammon is a computer backgammon program developed in 1992 by Gerald Tesauro at IBM's Thomas J. Watson Research Center. Its name comes from the fact that it is an artificial neural net trained by a form of temporal-difference learning, specif ...
, a computer program developed in 1992 for playing
backgammon Backgammon is a two-player board game played with counters and dice on tables boards. It is the most widespread Western member of the large family of tables games, whose ancestors date back nearly 5,000 years to the regions of Mesopotamia and Pe ...
. Four inputs were used for the number of pieces of a given color at a given location on the board, totaling 198 input signals. With zero knowledge built in, the network learned to play the game at an intermediate level by self-play and TD(\lambda). Seminal textbooks by
Sutton Sutton (''south settlement'' or ''south town'' in Old English) may refer to: Places United Kingdom England In alphabetical order by county: * Sutton, Bedfordshire * Sutton, Berkshire, a location * Sutton-in-the-Isle, Ely, Cambridgeshire * ...
and Barto on reinforcement learning, Bertsekas and Tsitiklis on neuro-dynamic programming, and others advanced knowledge and interest in the field. Katsunari Shibata's group showed that various functions emerge in this framework, including image recognition, color constancy, sensor motion (active recognition), hand-eye coordination and hand reaching movement, explanation of brain activities, knowledge transfer, memory, selective attention, prediction, and exploration. Starting around 2012, the so called Deep learning revolution led to an increased interest in using deep neural networks as function approximators across a variety of domains. This led to a renewed interest in researchers using deep neural networks to learn the policy, value, and/or Q functions present in existing reinforcement learning algorithms. Beginning around 2013, DeepMind showed impressive learning results using deep RL to play Atari video games. The computer player a neural network trained using a deep RL algorithm, a deep version of
Q-learning ''Q''-learning is a model-free reinforcement learning algorithm to learn the value of an action in a particular state. It does not require a model of the environment (hence "model-free"), and it can handle problems with stochastic transitions an ...
they termed deep Q-networks (DQN), with the game score as the reward. They used a deep
convolutional neural network In deep learning, a convolutional neural network (CNN, or ConvNet) is a class of artificial neural network (ANN), most commonly applied to analyze visual imagery. CNNs are also known as Shift Invariant or Space Invariant Artificial Neural Netwo ...
to process 4 frames RGB pixels (84x84) as inputs. All 49 games were learned using the same network architecture and with minimal prior knowledge, outperforming competing methods on almost all the games and performing at a level comparable or superior to a professional human game tester. Deep reinforcement learning reached another milestone in 2015 when
AlphaGo AlphaGo is a computer program that plays the board game Go. It was developed by DeepMind Technologies a subsidiary of Google (now Alphabet Inc.). Subsequent versions of AlphaGo became increasingly powerful, including a version that competed u ...
, a computer program trained with deep RL to play Go, became the first computer Go program to beat a human professional Go player without handicap on a full-sized 19×19 board. In a subsequent project in 2017,
AlphaZero AlphaZero is a computer program developed by artificial intelligence research company DeepMind to master the games of chess, shogi and go. This algorithm uses an approach similar to AlphaGo Zero. On December 5, 2017, the DeepMind team re ...
improved performance on Go while also demonstrating they could use the same algorithm to learn to play
chess Chess is a board game for two players, called White and Black, each controlling an army of chess pieces in their color, with the objective to checkmate the opponent's king. It is sometimes called international chess or Western chess to dist ...
and shogi at a level competitive or superior to existing computer programs for those games, and again improved in 2019 with
MuZero MuZero is a computer program developed by artificial intelligence research company DeepMind to master games without knowing their rules. Its release in 2019 included benchmarks of its performance in go, chess, shogi, and a standard suite of Atar ...
. Separately, another milestone was achieved by researchers from Carnegie Mellon University in 2019 developing
Pluribus The Pluribus''Pluribus'' is the ablative plural of the Latin word for "more" or "above." multiprocessor was an early multi-processor computer designed by BBN for use as a packet switch in the ARPANET. Its design later influenced the BBN Butterf ...
, a computer program to play poker that was the first to beat professionals at multiplayer games of no-limit
Texas hold 'em Texas hold 'em (also known as Texas holdem, hold 'em, and holdem) is one of the most popular variants of the card game of poker. Two cards, known as hole cards, are dealt face down to each player, and then five community cards are dealt fa ...
.
OpenAI Five OpenAI Five is a computer program by OpenAI that plays the five-on-five video game ''Dota 2''. Its first public appearance occurred in 2017, where it was demonstrated in a live one-on-one game against the professional player, Dendi, who lost to i ...
, a program for playing five-on-five Dota 2 beat the previous world champions in a demonstration match in 2019. Deep reinforcement learning has also been applied to many domains beyond games. In robotics, it has been used to let robots perform simple household tasks and solve a Rubik's cube with a robot hand. Deep RL has also found sustainability applications, used to reduce energy consumption at data centers. Deep RL for
autonomous driving A self-driving car, also known as an autonomous car, driver-less car, or robotic car (robo-car), is a car that is capable of traveling without human input.Xie, S.; Hu, J.; Bhowmick, P.; Ding, Z.; Arvin, F.,Distributed Motion Planning for Sa ...
is an active area of research in academia and industry. Loon explored deep RL for autonomously navigating their high-altitude balloons.


Algorithms

Various techniques exist to train policies to solve tasks with deep reinforcement learning algorithms, each having their own benefits. At the highest level, there is a distinction between model-based and model-free reinforcement learning, which refers to whether the algorithm attempts to learn a forward model of the environment dynamics. In model-based deep reinforcement learning algorithms, a forward model of the environment dynamics is estimated, usually by
supervised learning Supervised learning (SL) is a machine learning paradigm for problems where the available data consists of labelled examples, meaning that each data point contains features (covariates) and an associated label. The goal of supervised learning alg ...
using a neural network. Then, actions are obtained by using
model predictive control Model predictive control (MPC) is an advanced method of process control that is used to control a process while satisfying a set of constraints. It has been in use in the process industries in chemical plants and oil refineries since the 1980s. In ...
using the learned model. Since the true environment dynamics will usually diverge from the learned dynamics, the agent re-plans often when carrying out actions in the environment. The actions selected may be optimized using
Monte Carlo methods Monte Carlo methods, or Monte Carlo experiments, are a broad class of computational algorithms that rely on repeated random sampling to obtain numerical results. The underlying concept is to use randomness to solve problems that might be determini ...
such as the
cross-entropy method The cross-entropy (CE) method is a Monte Carlo method for importance sampling and optimization. It is applicable to both combinatorial and continuous problems, with either a static or noisy objective. The method approximates the optimal importance ...
, or a combination of model-learning with model-free methods. In model-free deep reinforcement learning algorithms, a policy \pi(a, s) is learned without explicitly modeling the forward dynamics. A policy can be optimized to maximize returns by directly estimating the policy gradient but suffers from high variance, making it impractical for use with function approximation in deep RL. Subsequent algorithms have been developed for more stable learning and widely applied. Another class of model-free deep reinforcement learning algorithms rely on
dynamic programming Dynamic programming is both a mathematical optimization method and a computer programming method. The method was developed by Richard Bellman in the 1950s and has found applications in numerous fields, from aerospace engineering to economics. ...
, inspired by
temporal difference learning Temporal difference (TD) learning refers to a class of model-free reinforcement learning methods which learn by bootstrapping from the current estimate of the value function. These methods sample from the environment, like Monte Carlo methods, a ...
and
Q-learning ''Q''-learning is a model-free reinforcement learning algorithm to learn the value of an action in a particular state. It does not require a model of the environment (hence "model-free"), and it can handle problems with stochastic transitions an ...
. In discrete action spaces, these algorithms usually learn a neural network Q-function Q(s, a) that estimates the future returns taking action a from state s. In continuous spaces, these algorithms often learn both a value estimate and a policy.


Research

Deep reinforcement learning is an active area of research, with several lines of inquiry.


Exploration

An RL agent must balance the exploration/exploitation tradeoff: the problem of deciding whether to pursue actions that are already known to yield high rewards or explore other actions in order to discover higher rewards. RL agents usually collect data with some type of stochastic policy, such as a
Boltzmann distribution In statistical mechanics and mathematics, a Boltzmann distribution (also called Gibbs distribution Translated by J.B. Sykes and M.J. Kearsley. See section 28) is a probability distribution or probability measure that gives the probability th ...
in discrete action spaces or a Gaussian distribution in continuous action spaces, inducing basic exploration behavior. The idea behind novelty-based, or curiosity-driven, exploration is giving the agent a motive to explore unknown outcomes in order to find the best solutions. This is done by "modify ngthe loss function (or even the network architecture) by adding terms to incentivize exploration". An agent may also be aided in exploration by utilizing demonstrations of successful trajectories, or reward-shaping, giving an agent intermediate rewards that are customized to fit the task it is attempting to complete.


Off-policy reinforcement learning

An important distinction in RL is the difference between on-policy algorithms that require evaluating or improving the policy that collects data, and off-policy algorithms that can learn a policy from data generated by an arbitrary policy. Generally, value-function based methods such as
Q-learning ''Q''-learning is a model-free reinforcement learning algorithm to learn the value of an action in a particular state. It does not require a model of the environment (hence "model-free"), and it can handle problems with stochastic transitions an ...
are better suited for off-policy learning and have better sample-efficiency - the amount of data required to learn a task is reduced because data is re-used for learning. At the extreme, offline (or "batch") RL considers learning a policy from a fixed dataset without additional interaction with the environment.


Inverse reinforcement learning

Inverse RL refers to inferring the reward function of an agent given the agent's behavior. Inverse reinforcement learning can be used for learning from demonstrations (or apprenticeship learning) by inferring the demonstrator's reward and then optimizing a policy to maximize returns with RL. Deep learning approaches have been used for various forms of imitation learning and inverse RL.


Goal-conditioned reinforcement learning

Another active area of research is in learning goal-conditioned policies, also called contextual or universal policies \pi(a, s, g) that take in an additional goal g as input to communicate a desired aim to the agent. Hindsight experience replay is a method for goal-conditioned RL that involves storing and learning from previous failed attempts to complete a task. While a failed attempt may not have reached the intended goal, it can serve as a lesson for how achieve the unintended result through hindsight relabeling.


Multi-agent reinforcement learning

Many applications of reinforcement learning do not involve just a single agent, but rather a collection of agents that learn together and co-adapt. These agents may be competitive, as in many games, or cooperative as in many real-world multi-agent systems. Multi-agent reinforcement learning studies the problems introduced in this setting.


Generalization

The promise of using deep learning tools in reinforcement learning is generalization: the ability to operate correctly on previously unseen inputs. For instance, neural networks trained for image recognition can recognize that a picture contains a bird even it has never seen that particular image or even that particular bird. Since deep RL allows raw data (e.g. pixels) as input, there is a reduced need to predefine the environment, allowing the model to be generalized to multiple applications. With this layer of abstraction, deep reinforcement learning algorithms can be designed in a way that allows them to be general and the same model can be used for different tasks. One method of increasing the ability of policies trained with deep RL policies to generalize is to incorporate
representation learning In machine learning, feature learning or representation learning is a set of techniques that allows a system to automatically discover the representations needed for feature detection or classification from raw data. This replaces manual feature e ...
.


References

{{cite arXiv, last1=Wulfmeier, first1=Markus, last2=Ondruska, first2=Peter, last3=Posner, first3=Ingmar, date=2015, title= Maximum Entropy Deep Inverse Reinforcement Learning , class=cs.LG, eprint=1507.04888 Machine learning algorithms Reinforcement learning Deep learning